About

Last updated: Fri 08 Feb 2019 09:25

About the Data Science Accelerator

The Data Science Accelerator is a capability-building programme which gives analysts from across the public sector the opportunity to develop their data science skills.

For this project, I will be guided by an experienced data scientist from the Ministry of Justice and based at the GDS hub in Whitechapel once a week between December 2018 and March 2019. As part of this accelerator, I aim to explore data science techniques to produce meaningful and helpful insights both to other teams within the Greater London Authority, as well as colleagues at participating London boroughs.

About the voter registration project

The main objective of the project is to use machine learning techniques to spot patterns and trends within the data that we would not detect with more traditional methods usually associated with this kind of data. The focus, for now, is on understanding the data. We hope that this will lead to insights, recommendations, ideas and confidence to pursue better-informed campaigns to maximize voter registration in the capital.

Please be aware that this project is exploratory in nature. So anything found on these pages are not carved in stone. If you have been sent a link to this page, we assume you have some kind of interest, so we are interested in your views.

Boroughs included in this analysis

  • Lewisham
  • Brent
  • Wandsworth
  • Waltham Forest
  • Greenwich
  • Lambeth
  • Croydon

Details about this site and different types of analysis

This site is the output of this project. Anything I learn, will be added to these pages. If any of it is not clear, or difficult to understand, please do get in touch and I will try to update accordingly. This site is hosted on github and written in RMarkdown, using the flexdashboard and crosstalk packages for functionality.

The site is divided into the following categories.

General information: details about the project, its background, stated aims and its data sources

Information about the data: how I have pulled all of the data together. Where it has come from, what I have done to it, etc.

Clustering: part of the data exploration, where I look for distinct groupings within the data

Predictive modelling: Statistical models that can predict which areas will have lower rates of voter registration. Here I use a particular type of model called a ‘decision tree’ that is very strong at explaining relationships between variables.


Icon made by Freepik from www.flaticon.com

Data

Separate into tabs

Main sources of data

My starting point for this project, for each borough, is an anonymised list of addresses from the electoral roll. I have merged in other data sets including:

  • ONS population estimates: Mid-year population estimates, that are broken down by output area and age
  • UKBuildings database: This innovative dataset combines categorisations of types and age of data using aerial satelite photography, then further completed with local authority information.
  • Indices of deprivation: Provided by the Ministry of Housing, Communities and Local Government, this data set provides a range of different indicators of deprivation
  • Census data: Wealth of data going back to the 2011 Census. Of particular importance for this project is data on tenure, education and ethnicity of residents
  • Energy (EPC)
  • Land registry
  • Consumer Data Research Centre (UCL): Research centre based at UCL that gathers and produces data on many different topics. For this project, we have used up to date data on population churn, and ethnic makeup of neighbourhoods.

Geographic area

  • The data was aggregated to Census Output (OA) level:
  • Typically a population of around 500 people
  • There are 886 OAs in Lewisham

Clustering

Separate into tabs

Aims

Aims of clustering

Unsupervised learning consists of statistical methods to extract meaning from data without categorising any of the data before. In this sense, unsupervised learning expects the machine to decide what the categories are. that Clustering data can be described as ‘the art of finding groups in data’. The main reasons for using this kind of technique are:

  • to identify distinct groups within a data set
  • as an extension of exploratory data analysis
  • to gain insight into how the data and how variables relate to one another

Given that OAs are purely administrative inventions - can we see meaningful groups emerge? Or is the data too noisy? The aim of this part of the analysis is to run, explain and measure the performance of buildings clusters on our data set. I have focused this part of the analysis on building age and type.

Buildings Data

At the moment, this data is only for Lewisham Council. I have run the clustering algortithm on buildings data, which has been split into categories for type and age.

Type: Flat block | Converted Flats | House (detached/semi-detached) | Terraced house

Age: Victorian/pre-WW1 | Interwar | Postwar (1945-1979) | Modern (1980-)

For each OA, the percentage of addresses that fall into each of the above categories is calculated. The clusters are based on these variables. What we are looking for here is whether or not OAs fit into neat categories of building types.

Algorithm

This analysis is carried out using k-means clustering, where we need to specify the number of categories, then the algorithms find that number of categories in the data. This algorithm starts off divides the data into K categories (a number chosen by whoever is running the algorithm) and finds centres in the data that minimise the sum of the squared distances from the data points to those centres.

Main findings

Five different clusters of different buildings seemed to emerge from the data, that separated out:

  • Victorian terraced housing, often converted into flats
  • High proportions of flat blocks built in the post-war (1945-79)
  • Terraced houses built between the wars
  • Modern flat blocks
  • Areas with more houses, mostly interwar

The two interwar categories have the highest levels of voter registration. The modern flat blocks have the lowest. The combination of higher than average deprivation, a younger population and more renters mean that t

The clustering here is not as well defined as we thought it might be. We think there are two main reasons for this:

  1. A lot of development have been quite locally based, so large areas with the same type of housing is not very widespread
  2. OAs are a quite arbitrary boundaries, drawn up for administrative purposes not to reflect any kind of neighbourhood, which makes this geographic area arguably unsuitable for clustering. Looking at streets might be better.

Frequency of clusters

Labelling the clusters

The algorithm produced the clusters, which I have named according to their characteristics (see next page to explore them yourself). It should be noted that in each of the clusters, a variety of different housing can be found. I have named them according to how prominent different types/ages of housing appear as compared to their average share.

Frequency

Cluster Label Frequency
1 pw_flats 202
2 iw_houses 67
3 mdn_flats 102
4 iw_terraces 131
5 vct_conv 382

Relationship with voter registration

The distribution of voter registration for each of the clusters is shown below. In these box plots the interquartile range is represented by the box, then median is shown as the line in the middle and the separate dots are the outliers. This shows that the ‘modern flats’ has the lowest levels and the two interwar clusters have the highest.

Cluster visualisation dashboard

Select one of the named clusters to see the distribution of different housing types, housing age, deprviation, average age and renting for each of the clusters.

In the top chart, the percentage of housing types and age are shown. If the colour of the bar is red, this means it is much lower than the average value for the borough. From there, the more green it is, the more above the average. For example, if the bar for flat blocks is bright green, this means that it is much higher than the average, if it is bright red, it is much lower and then the closer it is to brown, the closer it is to the average. This allows you to see the actual values represented in the chart, while still being able to understand how this compares to the whole borough.

In the bottom chart, the categories have different scales (IMD is 1:10, average age is usually between 30 and 70 and the missing and private rented ones are percentages). So I have presented this as a scaled value, according to the standard deviation.

The most common category has the highest number of missing values in the buildings data set. This is largely because houses that have converted into flats are least likely to be kept up to date.

Clusters Map

Silhouette

About this chart

This ‘silhouette chart’ represents how well the clustering model works. In very well clustered data, each of the bars would not go below the red dotted line (and certainly wouldn’t be below the 0 line!)

This chart means that most of the data does fit quite well into the five clusters, but a lot of it doesn’t. As we know the data is very noisy, and OAs are not ideal to perform clustering. The message from this chart is that quite a few areas are so mixed that you just cannot categorise them easily.

What else to look at

Some other things I could look at

  • Different unit of analysis (maybe street, rather than OA?)
  • Different combination of variables - could maybe incorporate things like average floorspace and tenure

Predictive Modelling

Separate into tabs

Aims

What I am trying to achieve with this analysis, is to produce a models that, based on the data we have collected, can accurately predict voter registration at OA-level and/or address-level.

I will be using three models to do this:

  • Linear/logistic regression
  • Decision tree
  • Random forest

The most interesting outputs from each of these models will be the way in which they come to their predictions, as much as the predictions themselves.

For the regression models, this will be explaining which variables are most important for predicting voter registration. By using the decision tree, I should be able explain combinations of values between different variables as well as estimating thresholds. The random forest model is a type of decision tree model, but should be more accurate (though less straightforward to explain)

Linear regression model

This page is currently work in progress. Something should be here soon!

Regression tree model

This page is currently work in progress. Something should be here soon!

Random forest model

This page is currently work in progress. Something should be here soon!

Tree visualisation

This page is currently work in progress. Something should be here soon!

Comparison of models’ predictions

Software/Code

This has been carried out entirely using the R programming language. For all of the code, please see the github page.

I will include some snippets of code here to explain how I ran some of the models.

Contact/feedback

If you have been sent a link to this site, I am definitely interested in your feedback, especially if anything is unclear or if you have any suggestions as to how I could improve this work.

The easiest way to contact me is by email to joe.heywood@london.gov.uk.

Some questions:

What’s next

This page is currently work in progress. Something should be here soon!